Zhu Lin
Climate change and potential global warming have become the significant issue of the day, monitoring temperatures across the region, therefore, is becoming increasingly important in response to weather-related disasters.
The data files US_weather_2015 and weather_stations correspondingly contain temperature of the United States at different stations and and their locations (marked by longitude, latitude and elevation) from January 1 to December 31, 2015, a period of 365 days.
The following packages are used in this project.
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(maps)
library(mapdata)
library(maptools)
library(graphics)
library(gstat)
library(sp)
library(viridis)
library(fields)
library(gridExtra)
library(xts)
library(spacetime) Read two data files US_weather_2015 and weather_stations
load("weather_data.RData")
load("weather_stations.RData")Dataframe weather has five variables: stn, year, month, day and temp.
head(weather_data)## stn year month day temp
## 1 916520 2015 1 1 83.0
## 2 916700 2015 1 1 86.9
## 3 788460 2015 1 1 81.8
## 4 916600 2015 1 1 84.3
## 5 965650 2015 1 1 79.0
## 6 406370 2015 1 1 47.6
Dataframe loc has seven variables: stn, name, country, state, lat, lon and elev.
head(weather_stations)## stn name country state lat lon elev
## 1 423630 MISSISSIPPI CANYON OIL PLATFO US LA 28.160 -89.220 37
## 2 619760 SERGE-FROLOW (ILE TROMELIN) US -15.883 54.517 13
## 3 621010 MOORED BUOY US 50.600 -2.933 -999
## 4 621110 MOORED BUOY US 58.900 -0.200 -999
## 5 621130 MOORED BUOY US 58.400 0.300 -999
## 6 621140 MOORED BUOY US 58.700 1.100 -999
We can get a rough idea of the monthly temperature in the United States from the histogram and boxplot.
We can get a rough idea of the monthly temperature in the United States from the histogram and boxplot.
To familiarize ourselves with the geography of the dataset, we will initially ignore the temporal component of the dataset and examine the spatial distribution of temperatures on a single day. Take day 1 as an example, therefore, we extract data of January 1, 2015.
## Extract day 1's data and sort it
head(day1)## stn year month day temp lon lat elev num
## 1 722212 2015 1 1 59.6 -81.340 29.959 3.1 1
## 2 722250 2015 1 1 41.1 -85.000 32.333 71.0 1
## 3 725377 2015 1 1 21.8 -82.833 42.600 177.0 1
## 4 746710 2015 1 1 29.2 -87.483 36.667 174.7 1
## 5 720394 2015 1 1 36.4 -93.066 34.100 55.8 1
## 6 722251 2015 1 1 37.2 -92.588 32.514 94.8 1
Since the temperature points are discrete with respect to the whole space (in the longitude-latitude-axis matrix, only 0.02% of elements are not NA), it is not available to show the United States temperatures by using the image.plot command in the fields package. In this case, we introduce the following method to plot the discrete points’ temperature distribution image.
A pronounced temperature gradient is visible from highs of over 85.6 Fahrenheit in the north of the study area which is near to the equator to a low of -14.7 Fahrenheit towards the southern boundary. This is not only indicative of spatial correlation in the dataset, but it also shows that the data are not stationary, as the mean temperature must vary strongly with latitude.
According to the temperature distribution image of January 1, 2015, we can find that on January 1 in 2015, the temperature in most parts of the United States was concentrated at 30 Fahrenheit and extreme temperatures only account for a small proportion.
Longitude, latitude, altitude and corresponding temperature were extracted from the data “day1”, 80% of which was training set and the rest was test set.
set.seed(123)
index = sample(nrow(Y_new),nrow(Y_new)*0.8)
train_X = X_new[index,]
test_X = X_new[-index,]
train_y = as.matrix(Y_new)[index]
test_y = as.matrix(Y_new)[-index]
train_Z = Z_new[index]
test_Z = Z_new[-index]library(fields)
par_est = spatialProcess(train_X,train_y,Z=train_Z,
mKrig.args = list(m = 3),
Distance = "rdist.earth",
cov.args = list(Covariance = "Matern",
smoothness = 1))
# where m=3 means quadratic form## Call:
## spatialProcess(x = train_X, y = train_y, Z = train_Z, mKrig.args = list(m = 3),
## cov.args = list(Covariance = "Matern", smoothness = 1), Distance = "rdist.earth")
##
## Number of Observations: 1850
## Degree of polynomial in fixed part: 2
## Total number of parameters in fixed part: 7
## Number of additional covariates (Z) 1
## MLE nugget variance ( sigma^2) 4.565
## MLE process variance (rho) 60.86
## MLE range parameter (theta, units of distance): 157.1
## Approx 95% lower bound: 131
## upper bound: 207.8
## Approx. degrees of freedom for curve 609.2
## Standard Error of df estimate: 7.066
## Nonzero entries in covariance 3422500
##
## Covariance Model: stationary.cov
## Covariance function: Matern
## Non-default covariance arguments and their values
## Argument: Covariance has the value(s):
## [1] "Matern"
## Argument: smoothness has the value(s):
## [1] 1
## Argument: Distance has the value(s):
## [1] "rdist.earth"
## Argument: theta has the value(s):
## theta
## 157.1435
## Argument: onlyUpper has the value(s):
## [1] FALSE
## Argument: distMat has the value(s):
## [1] NA
par_est2 = spatialProcess(train_X,train_y,Z=train_Z,
mKrig.args = list(m = 2),
Distance = "rdist.earth",
cov.args = list(Covariance = "Matern",
smoothness = 1))
# where m=2 means linear form## Call:
## spatialProcess(x = train_X, y = train_y, Z = train_Z, mKrig.args = list(m = 2),
## cov.args = list(Covariance = "Matern", smoothness = 1), Distance = "rdist.earth")
##
## Number of Observations: 1850
## Degree of polynomial in fixed part: 1
## Total number of parameters in fixed part: 4
## Number of additional covariates (Z) 1
## MLE nugget variance ( sigma^2) 4.733
## MLE process variance (rho) 139.8
## MLE range parameter (theta, units of distance): 257.7
## Approx 95% lower bound: 198.8
## upper bound: 384
## Approx. degrees of freedom for curve 571.9
## Standard Error of df estimate: 6.899
## Nonzero entries in covariance 3422500
##
## Covariance Model: stationary.cov
## Covariance function: Matern
## Non-default covariance arguments and their values
## Argument: Covariance has the value(s):
## [1] "Matern"
## Argument: smoothness has the value(s):
## [1] 1
## Argument: Distance has the value(s):
## [1] "rdist.earth"
## Argument: theta has the value(s):
## theta
## 257.6591
## Argument: onlyUpper has the value(s):
## [1] FALSE
## Argument: distMat has the value(s):
## [1] NA
As for quadratic polynomial trend
par_est$d## [,1]
## [1,] 566.58554323
## [2,] 6.22423087
## [3,] -10.68986435
## [4,] 0.02730697
## [5,] -0.02599832
## [6,] 0.07907762
## [7,] -0.00386215
The coefficient of Z is so small that it may be ignored.
par_est$eff.df## [1] 609.1969
par_est$lnProfileLike## [1] -4691.364
pre = predict(par_est, xnew = test_X, Z = test_Z)
RMSE = mean((pre-test_y)^2)/median(test_y)
RMSE## [1] 0.1412453
As for linear polynomial trend
par_est2$d## [,1]
## [1,] 105.311988795
## [2,] -0.154836179
## [3,] -2.111269961
## [4,] -0.003804903
The coefficient of Z is so small that it may be ignored.
par_est2$eff.df## [1] 571.8559
par_est2$lnProfileLike## [1] -4702.465
pre2 = predict(par_est2, xnew = test_X, Z = test_Z)
RMSE2 = mean((pre2-test_y)^2)/median(test_y)
RMSE2## [1] 0.1472223
From above output, we can know that the log Profile likelihood for lambda of model 1 is larger than model 2, and this implies the good fitness of the former. Furthermore, the behavior of RMSE in model 1 is smaller than that in model 2. In view of goodness of fit and interpretability with relatively smaller RMSE, we choose model 1. In fact, the covariance model also can be selected, but the prediction effect of candidate models have no obvious difference, so the \(Mat\acute{e}rn\) model with flexible parameters is used to fit the model of the whole data set.
## parameter estimation
#(fields use great circle distances, approxiamte the earth by a ball)
par_est_whole = spatialProcess(train_X,train_y,Z=train_Z,
mKrig.args = list(m = 3),
Distance = "rdist.earth",
cov.args = list(Covariance = "Matern",smoothness = 1))
# where m=3 means quadratic form
print(par_est_whole)
sigma.square = as.numeric(par_est_whole$sigma.MLE)^2 # nugget variance (sigma^2) 4.225
rho = as.numeric(par_est_whole$rho.MLE) # process variance (rho) 60.12
theta = as.numeric(par_est_whole$theta.MLE) # range parameter (theta) 150.646(miles) ## Create testing data by meshgrid resolution of (60/50)° x (25/50)° of longitude and latitude
lon_num = 50
lat_num = 50
lon_seq = seq(-125,-65,length.out=lon_num)
lat_seq = seq(25,50,length.out=lat_num)
## Restrict the grid to the USA mainland
require(mapdata)
tmp <- map('worldHires', 'Usa', fill=TRUE, plot=FALSE)
require(maptools)
US.boundary <- map2SpatialPolygons(tmp, IDs = tmp$names, proj4string = CRS(as.character(NA)))
US.bbox.grid <- SpatialPoints(cbind(rep(lon_seq,length(lat_seq)),
rep(lat_seq,each=length(lon_seq))))
gridded(US.bbox.grid) <- TRUE
US.grid <- US.bbox.grid[!is.na(over(US.bbox.grid, US.boundary)),]
Z = as.matrix(rep(mean(Z_new),length(US.grid@coords[,1])))
pre_whole = predict(par_est_whole, xnew = US.grid@coords,Z = Z) # Z is not necessaryrange(par_est_whole$residuals)## [1] -13.14837 18.24184
The range of the fitted residuals is \((-13.14837,18.24184)\), and the bubble diagram reflects the overall fitted effect of the spatial model with the quadratic polynomial trend and \(Mat\acute{e}rn\) covariance, except that a few coordinates have a large deviation in the predicted value which could be a result of high elevation.